
Mapd’O Network metrics exploration
Progress meeting 23.04.2024
Points
Dataset overview
Variable cleaning
Principal Component Analysis
K-means Clustering
Hidden-Markov-Modeling
Next steps
Dataset overview
# plot histograms for all variables
DataExplorer::plot_histogram(data_all, nrow = 4)




Dataset containing only the normalized variables for the land use and lateral continuity (area normalized to the valley bottom area)
Variable cleaning
Based on high values of correlation and similarities in the PCA, the following variables are removed:
floodplain_slopeas it is represented well bytalweg_slopegravel_bars_pcas it is represented well inactive_channel_pcwater_channel_widthas it is represented well inactive_channel_widthvalley_bottom_widthbysum_areasemi_natural_pcas it is falsely calculated and only representsgrassland_pcreversible_pcas as it is falsely calculated and only representsgrassland_pcandcrops_pcinfrastructures_pc,dense_urban_pc, anddiffuse_urban_pcare well represented bybuilt_environment_pcnatural_corridor_widthrepresented well byconnected_corridor_width

PCA










According to the results, the first four principal components are sufficient to represent 64.5 % of the variability of the data set. In the following, each of these PCs is analysed according to the individual association of the variables to them in order to facilitate interpretation.
| PC | Description |
|---|---|
| PC1 | Positive values indicate large rivers in wide valleys with low slopes and low elevations, with comparably small riparian corridor and diverse anthropogenic activity in the adjacent areas. Negative values indicate smaller rivers in narrow valleys with higher slopes and elevations, with a greater relative area for the riparian corridor and less activity in the adjacent areas.
|
| PC2 | Positive values indicate rather narrow valleys in which most of the space is taken by the water channel with few space for the connected corridor and crops. Negative values indicate wide valleys with smaller channel width to valley width ratios and larger shares of connected corridor and crops.
|
| PC3 | Positive values indicate comparably large and forested riparian corridors in lower elevations with few grassland and natural open area. Negative values thus indicate comparably small and unforested riparian corridors in higher elevations and with more natural open areas and grasslands.
|
| PC4 | Positive values indicate rather smaller, confined streams with a strong presence of anthropogenic infrastructure. Negative values thus indicate comparably larger rivers with more space for the active channel and no presence of built/anthropogenic infrastructure in the adjacent zones.
|

















K-means Clustering
K-means is a clustering method that generates clusters based on the search for centers of gravity to which the mean distance from the associated data points is minimized. In order to apply this method, the number of clusters must first be determined. For this purpose, 24 different indices were evaluated using the NBClust-package. Among all indices:
- 4 proposed 2 as the best number of clusters
- 5 proposed 3 as the best number of clusters
- 2 proposed 4 as the best number of clusters
- 10 proposed 5 as the best number of clusters
- 1 proposed 8 as the best number of clusters
- 1 proposed 9 as the best number of clusters
- 1 proposed 10 as the best number of clusters
According to the majority rule, the best number of clusters is 5.




Based on the data-distributions, the main characteristics of the clusters are summarized in the following table:
| Cluster | Derived characteristics |
|---|---|
| 1 | Rivers confined by anthropogenized floodplain Rather confined, lower elevation rivers with altered riparian zone including diverse usages such as urban and agricultural infrastructure.
|
| 2 | Larger rivers with agricultural landscape Larger rivers in wide valleys with low slopes and low elevations, with semi-intensive riparian corridor use due to agricultural activity.
|
| 3 | Small upstream rivers Smaller and unforested riparian corridors in higher elevations and with more natural open areas and grasslands and less activity in the adjacent areas.
|
| 4 | Forested medium-sized rivers Large and forested riparian corridors in lower elevations with few grassland and natural open area.
|
| 5 | Diverse medium-sized and large rivers Medium-sized and larger streams in lower elevations with different landuse patterns and active channel sizes.
|

Next steps
manual clustering
advance with dependent mixture / HMM model :
multivariate modelling with depmixS4-package
Can river network be modelled through a geographic tree structure and not a unidirectional chain, e.g. as in Jiang et al. (2019) ?
start learning R-Shiny development (e.g. with Lise’s course and ThinkR material)
Manual clustering
# data_all <- network_dgo |> st_drop_geometry()
#
# mapview(network_dgo)
#
# # grouping
# # classname - variable - sign - value - color
#
# grouping <- c(classname = character(),
# variable = character(),
# sign = character(),
# value = numeric(),
# color = character())
#
# grouping <- list(classname = "humanimpact",
# variable = "dense_urban",
# sign = ">=",
# value = 0.3,
# color = "red")
#
#
# data_all <- data_all |> dplyr::filter(!!rlang::sym(grouping$variable[1]) > 30)
#
# ## check out data-masking from dplyr::filter function
#
# rlang::sym(grouping$sign[1]) rlang::sym(grouping$value[1]))

